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Abstract — The distributed representation of correlated multi- 
view images is an important problem that arise in vision sensor 
networks. This paper concentrates on the joint reconstruction 
problem where the distributively compressed correlated images 
are jointly decoded in order to improve the reconstruction 
quality of all the compressed images. We consider a scenario 
where the images captured at different viewpoints are encoded 
independently using common coding solutions (e.g., JPEG, H.264 
intra) with a balanced rate distribution among different cameras. 
A central decoder first estimates the underlying correlation model 
from the independently compressed images which will be used for 
the joint signal recovery. The joint reconstruction is then cast as a 
constrained convex optimization problem that reconstructs total- 
variation (TV) smooth images that comply with the estimated 
correlation model. At the same time, we add constraints that force 
the reconstructed images to be consistent with their compressed 
versions. We show by experiments that the proposed joint 
reconstruction scheme outperforms independent reconstruction 
in terms of image quality, for a given target bit rate. In addition, 
the decoding performance of our proposed algorithm compares 
advantageously to state-of-the-art distributed coding schemes 
based on disparity learning and on the DISCOVER. 

Index Terms — Distributed compression, Joint reconstruction, 
Optimization, Multi-view images, Depth estimation. 



I. Introduction 

In recent years, vision sensor networks have been gaining 
an ever increasing popularity enforced by the availability 
of cheap semiconductor components. These systems usually 
acquire multiple correlated images of the same 3D scene from 
different viewpoints. Compression techniques shall exploit this 
correlation in order to efficiently represent the 3D scene infor- 
mation. The distributed coding paradigm becomes particularly 
attractive in such settings; it permits to efficiently exploit the 
correlation between images with low encoding complexity and 
minimal inter-sensor communication, which directly translate 
into power savings in sensor networks. In the distributed 
compression framework, a central decoder jointly reconstructs 
the visual information from the compressed images by ex- 
ploiting the correlation between the samples. This permits to 
achieve a good rate-distortion tradeoff in the representation 
of correlated multi-view images, even if the encoding is 
performed independently. 

The first information-theoretic results on distributed source 
coding appeared in the late seventies for the noiseless [2] and 
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noisy cases O. However, most results in distributed coding 
have remained non-constructive for about three decades. Prac- 
tical DSC schemes have then been designed by establishing a 
relation between the Slepian-Wolf theorem and channel coding 
[4j. Based on the results in [4], several distributed coding 
schemes for video and multi-view images have been proposed 
in the literature 0, 15) . In such schemes, a feedback channel 
is generally used for accurately controlling the Slepian-Wolf 
coding rate. Unfortunately, this results in increased latency 
and bandwidth usage due to the multiple requests from the 
decoder. These schemes can thus hardly be used in real time 
applications. One solution to avoid the feedback channel is 
to use a separate encoding rate control module to precisely 
control the Slepian-Wolf coding rate [7|. The overall compu- 
tational complexity at the encoder becomes non-negligible due 
to this rate control module. In this paper, we build a distributed 
coding scheme, where the correlated compressed images are 
directly transmitted to the joint decoder without implementing 
any Slepian-Wolf coding; this avoids the necessity for complex 
estimation of the statistical correlation estimation and of the 
coding rate at the encoder. 

We consider a scenario where a set of cameras are dis- 
tributed in a 3D scene. In most practical deployments of such 
systems, the images captured by the different cameras are 
likely to be correlated. The captured images are encoded inde- 
pendently using standard encoding solutions and are transmit- 
ted to the central decoder. Here, we assume that the images are 
compressed using balanced rate allocation, which permits to 
share the transmission and computational costs equally among 
the sensors. It thus prevents the necessity for hierarchical 
relationship among the sensors. The central decoder builds 
a correlation model from the compressed images which is 
used to jointly decode the multi-view images. The joint recon- 
struction is formulated as a convex optimization problem. It 
reconstructs the multi-view images that are consistent with the 
underlying correlation information and with the compressed 
images information. While reconstructing the images, we also 
effectively handle the occlusions that commonly arise in multi- 
view imaging. We solve the joint reconstruction problem using 
effective parallel proximal algorithms 10. 

We evaluate the performance of our novel joint decoding 
scheme in several multi-view datasets. Experimental results 
demonstrate that the proposed distributed coding solution im- 
proves the rate-distortion performance of the separate coding 
results by taking advantage of the inter-view correlation. We 
show that the quality of the decoded images is quite balanced 
for a given bit rate, as expected from a symmetric coding 
solution. We observe that our scheme, at low bit rate, performs 
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Fig. 1. Schematic representation of our proposed framework. The images X\ and X2 are correlated through displacement of scene objects due to positioning 
of the cameras C\ and Ci- 



close to the joint encoding solutions based on H.264, when 
the block size used for motion compensation is set to 4 x 4. 
Finally, we show that our framework outperforms state-of-the- 
art distributed coding solutions based on disparity learning (9) 
and on the DISCOVER codec fTOlh in terms of rate-distortion 
performance. It certainly provides an interesting alternative to 
most classical DSC solutions (5), (6), (7), since it does not 
require any statistical correlation information at the encoder. 

Only very few works in the literature address the distributed 
compression problem without using a channel encoder or a 
feedback channel. In ifTHl . a distributed coding technique for 
compressing the multi-view images has been proposed, where 
a joint decoder reconstructs the views from low resolution 
images using super-resolution techniques. In more details, 
each sensor transmits a low resolution compressed version of 
the original image to the decoder. At the decoder, these low 
resolution images are registered with respect to a reference 
image, where the image registration is performed by shape 
analysis and image warping. The registered low resolution 
images are then jointly processed to decode a high resolution 
image using image super-resolution techniques. However, this 
framework requires communication between the encoders in 
order to facilitate the registration, e.g., the transmission of 
feature points. Other works in super-resolution use multiple 
compressed images that are fused for improved resolution [Oil . 
Such techniques usually target reconstruction of a single high 
resolution image from multiple compressed images. Alterna- 
tively, techniques have been developed in lfT3lL lfl~4l to decode 
a single high quality image from several encoded versions of 
the same source image or videos. This is achieved by solving 
an optimization problem that enforces the final reconstructed 
image to be consistent with all the compressed copies. Our 
main target in this paper is to jointly improve the quality of 
multiple compressed correlated (multi-view) images and not to 
increase the spatial resolution of the compressed images or to 
extract a single high quality image. More recently, Schenkel et 
al. [031 have considered a distributed representation of image 
pairs. In particular, they have proposed an optimization frame- 
work to enhance the quality of the JPEG compressed images. 
This work, however, considered an asymmetric scenario that 
requires a reference image for joint decoding. 

The rest of the paper is organized as follows. The joint 
decoding algorithm along with the optimization framework for 
joint reconstruction is described in Section |TT] In Section [TTTl 
we present the optimization algorithm based on proximal 



splitting methods. In Section [TVl we present the experimental 
results for the joint reconstruction of pairs of images. Sec- 
tion [V] describes the extension of our proposed framework 
to decode multiple images along with the simulation results. 
Finally, in Section [VT] we draw some concluding remarks. 

II. Joint Decoding of Image Pairs 

We consider the scenario illustrated in Fig. Q] where a pair 
of cameras C\ and C2 project the 3D visual information on the 
2D plane X\ and X2 (with resolution N = Ni x 7V 2 ), respec- 
tively. The images X\ and X2 are compressed independently 
using standard encoding solutions (e.g., JPEG, H.264 intra) 
and are transmitted to a central decoder. The joint decoder 
has the access to the compressed version of the correlated 
images and its main objective is to improve the quality of 
all the compressed views by exploiting the underlying inter- 
view correlation. We first propose to estimate the correlation 
between images from the decoded images 1\ and I2, which 
is effectively modeled by a dense depth image D. The joint 
reconstruction stage then uses the depth information D and 
enhances the quality of the decoded images I\ and I2. Note 
that one could solve a joint problem to estimate simultane- 
ously the correlation information D and the improved images. 
However, such a joint optimization problem would be hard 
to solve with a complex objective function. Therefore, we 
propose to split the problem in two steps: (i) we estimate a 
correlation information from the decoded images; and (ii) we 
carry out joint reconstruction using the estimated correlation 
information. These two steps are detailed in the rest of this 
section. 

A. Depth Estimation 

The first task is to estimate the correlation between images, 
which typically consists in a depth image. In general, the dense 
depth information is estimated by matching the corresponding 
pixels between images. Several algorithms have been proposed 
in the literature to compute dense depth images. For more 
details, we refer the reader to [|T6l . In this work, we estimate 
a dense depth image from the compressed images in a regu- 
larized energy minimization framework, where the energy E 
is composed of a data term Ed and a smoothness term E s . A 
dense depth image D is obtained by minimizing the energy 
function E as 

D = argmin E{D C ) = argmin {E d (D c ) + A E S (D C )}, (1) 

D c D c 
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where A balances the importance of the data and smoothness 
terms, and D c represents the candidate depth images. The 
candidate depth values D c (m,n) for every pixel position 
(ra, n) are discrete; this is constructed by uniformly sampling 
the inverse depth in the range [1/D max , 1/D m i n ], where 
D min and D m ax are the minimal and maximal depth values 
in the scene, respectively liTTIl . 

We now discuss in more details the components of the 
energy function of Eq. (Q]). The data term, Ed is used to match 
the pixels across views by assuming that the 3D scene surfaces 
are Lambertian, i.e., the intensity is consistent irrespective of 
the viewpoints. It is computed as 

iVi N 2 

E d (D c ) = 2 £ C((m, n), D c (m, n)), (2) 

where Ni and N 2 represent the image dimensions and (ra, n) 
represent a pixel position. The most commonly used pixel- 
based cost function C includes squared intensity differences 
and absolute intensity differences. In this work, we use square 
intensity difference to measure the disagreement of assigning 
a depth value D c (m,n) to the pixel location (ra, n). Mathe- 
matically, it is computed as 

C((m,n),D c (m,n)) — ||l2(ra,n) — W(/i(ra, n), D c (m, 

where W is a warping function that warps the image I\ using 
the depth value D c (m,n). This warping, in general, is a two 
step process |[T8ll . First the pixel position (ra, n) in the image 
I\ is projected to the world coordinate system. This projection 
step is represented as 

[u,v,w] T = i?iP 1 " 1 [m,n,l] T J D c (m,n) +Ti, (4) 

where Pi is the intrinsic camera matrix of the camera C\ 
and (i?i,Ti) represent the extrinsic camera parameters with 
respect to the global coordinate system. Then, the 3D point 
[u,v,w] T is projected on the coordinates of the camera C2 
with the internal and external camera parameters, respectively 
as P2 and (^2,^2). This projection step can be described as 

[*', y\ z'\ T = P 2 R 2 - 1 {[u, v, w] T - T 2 }. (5) 

Finally, the pixel location of the warped image is taken as 
(ra/,n') = (round(x' / z'),(round(y' / z')), where round(x) 
rounds x to the nearest integer. 

The smoothness term, E s is used to enforce consistent depth 
values at neighboring pixel locations (ra, n) and (ra, h). It is 
measured as 

E S (D C ) = ^2 min(\D c (m,n) - D c (m,h)\,r), 

(m,n) , (m,n) £j\f 

where M represents the usual four-pixel neighborhood and 
t sets an upper level on the smoothness penalty such that 
discontinuities can be preserved (19). 

We can finally rewrite the regularized energy objective 



function for the depth estimation problem as 

iVi N 2 

E(D C ) = ^ Y^C((m,n),D c (m,n)) + 

m=l n=l 

A ^ min(\D c (m, n) — D c (rh, n)|, r). 

(m,n),(m,h)EJ\f 

(7) 

This cost function is used in the optimization problem of 
Eq. (Q]), which is usually a non-convex problem. Several 
minimization algorithms exist in the literature to solve Eq. (OQ), 
e.g., Simulated annealing ll20li . Belief Propagation EH, Graph 
Cuts L22J, ll23l . Among these solutions, the optimization 
techniques based on Graph Cuts compute the minimum energy 
in polynomial time and they generally give better results than 
the other techniques lfT6l . Motivated by this, in our work, we 
solve the minimization problem of Eq. (Q} using Graph Cut 
techniques. 

B. Image Warping as Linear Transformation 

Before describing our joint reconstruction problem, we 
show how the image warping operation W{I\,D) in Eq. © 
can be written as matrix multiplication of the form A-7£(/i)Lj; 
this linear representation offers a more flexible formulation 
of our joint reconstruction problem. The reshaping operator 
*R> : -T/v~i x a^2 ~^ -^iViiv 2 xi produces a vector X = 71(1) = 
[I T ! I T 2 . . • ^ T jvJ T from the matrix /, where / j7n represents 
the m th row of the matrix / and (.) T denotes the usual trans- 
pose operator. For our convenience, we also define another 
operator T^\xn 2 ' ^■n 1 n 2 xi In!xn 2 that takes the vector 
X = [JZ(I)]n 1 n 2 xi and gives back the matrix In ± xn 2 , i.e., 
this operator 7Z~ l performs the inverse operations correspond- 
ing to 1Z. The matrix A describes the warping by re-arranging 
the elements of H(Ii). Its construction is described in this 
section. 

We have shown earlier that the warping function W shifts 
the pixel position (ra, n) in the reference image to the po- 
sition (m f ,n f ) in the target image. Alternatively, this pixel 
shift between images can be represented using a horizontal 
component and a vertical component m v of the motion 
field as (m! ,n') = (ra + m /l (ra,n),n + m v (m,n)). Note 
that this motion field (m^, m v ) can be easily computed from 
Eqs. © and ©, once the depth information D and the camera 
parameters are known. Now, our goal is to represent the motion 
compensation operation 1\ (ra + m /l (ra, n), n + m v (ra, n)) as 
a linear transformation A ■ 7l(Ii) given as 
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(8) 



Here, I2 = >V(Ii(ra, n), D) represents the warped image and 
A m is a matrix of dimensions N2 x N1N2 whose entries are 

^or consistency, we use the compressed image 1\\ however, this matrix 
multiplication holds even if one uses the original image X\ for warping. 
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determined by the horizontal and vertical components of the 
motion field in the m th row, i.e., m h (m, .) and m v (m, .). 

In general, the elements of the matrix A m can be found in 
two ways: (i) forward warping; and (ii) inverse warping. In 
this work, we propose to construct the matrix A m based on 
forward warping; this permits easier handling of the occluded 
pixels as shown later. Given a motion vector, the elements of 
the matrix A m are given by 

f 1 if m h (m,n) = f3 1: 
A m (n -h- f3 2 N 2 ,n) = I and m v (m,n) = /? 2 , (9) 

[ otherwise. 

If n — /3i — P2N2 < (e.g., at image boundaries), we set 
n — f3\ — P2N2 = 1 so that the dimensions of the matrix A m 
stays N 2 x NiN 2 . It should be noted that the matrix A m 
formed using Eq. © contains multiple entries with values of 
'1' in each row. This is because several pixels in the source 
image can be mapped to the same location in the destination 
image during forward warping. In such cases, for a given row 
index m we keep only the last '1' entry in the matrix A m 
while the remaining ones in the row are set to zero. This 
is motivated by the fact that, during forward warping when 
multiple source pixels are mapped to the same destination 
point (m f ,n f ), the intensity value of the last source pixel is 
assigned to the destination pixel (ra', n') @. Furthermore, it is 
interesting to note that some of the rows in the matrix A m 
do not contain any entry with value of '1', i.e., all entries in 
m th row of A m are zeros. This means that the set of pixel 
locations {j : j G J m } in the warped image h.md) has zero 
value, where J m is the set of row indexes in the matrix A m 
that do not contain any entry with value of '1'. These pixel 
positions represent holes in the warped image that define the 
occluded regions. Finally, the m th row in the warped image 
is represented as 

j ( = / if 3 e J m 

2,mUj \ h(k,n) if A m (j,(k-m)N2+n) = 1. 

(10) 

Thus, it is clear that the matrix A m shifts the pixels in h 
by the corresponding motion vector (m h (m, .), m v (m, .)) in 
order to form T^m- In a similar way, we can construct the 
matrix A m , Vra G {1, 2, . . . , Ni}, and thus we can represent 
the image warping W(h(m,n),D) as A-lZ(h). Finally, note 
that similar operations can also be performed with an inverse 
mapping. For details related to the construction of the matrix 
A m based on inverse warping, we refer the reader to ll24] Ch. 
6, p. 95]. 



C. Joint reconstruction 

We now discuss our novel joint reconstruction algorithm 
that takes benefit of the estimated correlation information 
given by the matrix A (or D) in order to reconstruct the 
images. We propose to reconstruct an image pair (A, ^2) as a 

2 We assume that the pixels are scanned from left to right and then top to 
bottom. 



solution to the following optimization problem: 

(hj 2 ) = argmin (H^ \\ TV + \\I 2 \\ TV ) (11) 

s.t. \\-R(I 1 )-TZ(I 1 )\\ 2 <e 1 , 
\\n(I 2 )-Tl(I 2 )\\ 2 <e 1 , 
\\ni2)-A-n(I 1 )\\l<e 2 . 

Here, h and I2 represent the decoded views (see Fig. [TJ and 
||.|| T y represents the total- variation (TV) norm. The first two 
constraints of Eq. (fTTT) forces the reconstructed images h and 
I2 to be close to the respective decoded images h and I2. 
The last constraint encourages the reconstructed images to be 
consistent with the correlation information represented by A, 
i.e., the warped image A • 7Z(h) should be consistent with 
the image IZ^h)- Finally, the TV prior term ensures that the 
reconstructed images h and I2 are smooth. In general, inclu- 
sion of the prior knowledge brings effective reduction in the 
search space, which leads to efficient optimization solutions. 
The optimization problem of Eq. (fTTT) . therefore reconstructs 
a pair of TV smooth images that is consistent with both the 
compressed images and the correlation information. In our 
framework, we use the TV prior on the reconstructed images, 
however one could also use a sparsity prior that minimizes 
the h norm of the coefficients in a sparse representation of 
the images G3, (261 

In the above formulation, it is clear that we measure the 
correlation consistency of all the pixels in the image 1Z(h) 
and the warped image A • 7Z(I\). However, this assumption 
is not true in multi-view imaging scenarios, as there are often 
problems due to occlusions. This indicates that we need to 
consider only the pixels that appear in both the views and 
we need to ignore the holes in the warped image A • 
while enforcing consistency between 71(12) and A-TZ(Ii). The 
positions of holes in the warped image A-TZ(Ii) correspond to 
the row indexes in the matrix A that do not contain any value 
of '1', i.e., all entries in a given row are zero. Once these 
rows are identified, we simply ignore that contribution while 
we measure the correlation consistency between the images 
K(I 2 ) and A ■ 11(h). More formally, let J = (J^Li J m be 
the set of indexes of these rows. Let us denote a diagonal 
matrix M that is formed as 

M ^) = {\ other^ife, < 12 > 

where j = {1, 2, . . . , NiN 2 }. For effective occlusion han- 
dling, the joint reconstruction problem of Eq. (ITTt can be 
modified as 

(hj 2 ) = argmin (\\h\\ TV + \\I 2 \\ TV ) (OPT-1) 

S.t. ITO)-^/!)!!^!, 

H^(/ 2 )-7e(j 2 )|| 2 < ei , 

\\M(1Z(I 2 )-A.n(h))f 2 <e 2 . 

Note that, by setting M = 1, we get the optimization problem 
of Eq. (TTTb that considers the consistency of all the pixels in 
1Z(I 2 ) and A • 11(h). We show later that the quality of the 
reconstructed images are improved, when our joint decoding 
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problem OPT-1 is solved with the matrix M constructed 
using Eq. (fT2b . Finally, the depth estimation and the joint 
reconstruction steps could be iterated several times. In our 
experiments, however, we have not observed any significant 
improvement in the quality of the reconstructed images by 
repeating these two steps. 

III. Optimization methodology 

We propose now a solution for the joint reconstruction 
problem IOPT-11 We first show that the optimization problem 
is convex. Then, we propose an effective solution based on 
proximal methods. 

Proposition 1: The OPT-1 optimization problem is convex. 
Proof: Our objective is to show that all the functions in 
IOPT- II problem are convex. However, it is quite easy to check 
that the functions \\Ij\\ TV and \\K(Ij) - U(Ij)\\ 2 ^j G {1,2} 
are convex [27]. So, we have to show that the last constraint 
\\M(1Z(I 2 ) - A • K(h))\\ 2 is a convex function. 

Let g(hj 2 ) = \\h - Ah\\l where J 2 = M • 1Z(I 2 ), A = 
MA and h = 1Z(h). The function g can be represented as 



ilAh 



i?A T i 2 



i^A T Ah. 



The second derivative V 2 g of the function g is given as 



V 2 £ 



2AA T 
-2A T 



-2A 
2 



2C 1 C y 0. 



Here, C = [A T — 1], where 1 represents the identity matrix 
and 2C T C h follows from 2x T C T Cx = 2\\Cx\\ 2 2 > for 
any x. This means that the Hessian function V 2 g is positive 
semi-definite and thus g(h,I 2 ) is convex. ■ 
We now propose an optimization methodology to solve 
IOPT-11 convex problem with proximal splitting methods ||8). 
For mathematical convenience, we rewrite IOPT-11 as 



argmin {^^(SiX) 



I TV 



\n-\s 2 x)\\ TV } 



s.t. ||5 1 (y-x)|| 2 <6 1 ,||5 2 (y-x)|| 2 <6 1 , a3) 

\\BX\\ 2 2 <e 2 , 

where X = [71(h) ;ft(J 2 )], Y = [71(h) ;ft(/ 2 )], Si = 
[1 0], S 2 = [0 1], B = [—MA M] and 1 represents the 
identity matrix. Recall that T^n\xn 2 (for simplicity we omit 
the subscript in Eq. ([T3]) ) is the operator that outputs a matrix 
of dimensions N\ x 7V 2 from a column vector of dimensions 
TV = NiN 2 . The optimization problem of Eq. (TT31) can be 
visualized as a special case of general convex problem as 

argmm{A(X) + f 2 (X) + f 3 (X) + f 4 (X) + f 5 (X)}, (14) 

xeR 2N 



where the functions fiJ 2 J 3 J^ h e T (R 2 ") \«\.r (R 2 ") 
is the class of lower semicontinuous convex functions from 
R 2N to (— oo + oo] that are not infinity everywhere. For the 
optimization problem given in Eq. (fT3l) , the functions in the 
representation of Eq. (ITU) are 

1) f 1 (x) = \\n- 1 (SiX)\\ TV , 

2) f 2 (X) = \\K- 1 (S 2 X)\\ TV , 



3) f 3 (X)=i Cl (X) 



X e ci 
oo otherwise, 
i.e., fz{X) is the indicator function of the closed convex 
set ci = {X:||5i(F-X)|| 2 < ei }, 

C2 (W° XeC2 



oo otherwise, 
= {X:\\S;(Y-X)\\ 2 <e 1 }, 



4) U(X) = 
where c 2 

5) f 5 (X) = UC3K ~j- , , u . 

v 7 3V y 1 oc otherwise, 

where c 3 = {X : \\BX\\ 2 < e 2 }. 

The solution to the problem of Eq. (ITU) can be estimated 
by generating the recursive sequence X^ t+1 ^ = proxf(X^), 
where the function / is given as / = Y^=i fa- The proximity 
operator is defined as the proxf(X) = minx {f(X) + 
7}\\X — Z\\ 2 }. The main difficulty with these iterations is the 
computation of the proxj(X) operator. There is no closed 
form expression to compute the proxf(X), especially when 
the function / is the cumulative sum of two or more functions. 
In such cases, instead of computing the proxf(X) directly 
for the combined function /, one can perform a sequence 
of calculations involving separately the individual operators 
proxf i (X) J \Ji G {1,...,5}. The algorithms in this class 
are known as splitting methods |8], which lead to an easily 
implementable algorithm. 

We describe in more details the methodology to compute 
the prox for the functions Vz G {1, . . . , 5}. For the function 
h(x) = wn-^s^w TV , the prox fa (X) can be computed 
using Chambolle's algorithm l|28l . A similar approach can 
be used to compute the proxf 2 (X). The function / 3 can 
be represented as f% = F o G, where F = id( ei ) an d 
G = S\X — S\Y. The set d{e\) represents the / 2 -ball defined 
as d(e\) = {y G M? N : \\y\\ 2 < ei}. Then, the proxf 3 can be 
computed using the following closed form expression: 

prox h (X) =prox FoG (X) =X + (S 1 )*(prox F -t)(G(X)) 

(15) 

(29|, where (Si)* represents the conjugate transpose of Si. 
The proxp(y) with F = id( ei ) can be computed using radial 
projection [8] as 



prox F (y) 



h\\ 2 <zi 

otherwise. 



(16) 



The prox for the function f± can also be solved using Eq. (TBI) 
by setting F = id( ei ) an d G = S 2 X — S 2 Y. Finally, the 
function f$ can be represented with F = id(^) an d a n affine 
operator Gi = BX, i.e., f$ = F o G\. As the operator B 
is not a tight frame, the proxf 5 can be computed using an 
iterative scheme ll29l . Let fi t G (0,2/72), and 71 and 7 2 be 
the frame constants with 71 1 < BB* < 72 1. The proxf 5 can 
be calculated iteratively |[29l as 



t (*+D = 



Mi 
x - 



)fo 



GipW) (17) 
(18) 



where u^*^ — > u and p^*^ — >• proxFoG = P r oxf 5 = X — B*u. 
It has been shown that both and converge linearly and 
the best convergence rate is attained when p t = 2/(71 +72)- 
In our work, we use the parallel proximal algorithm 
(PPXA) proposed by Combettes et al. @ to solve Eq. (TBI) . 
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as this algorithm can be easily implementable on multicore 
architectures due to its parallel structure. The PPXA 
algorithm starts with an initial solution X^ ^ and computes 
the proxf^Vi G {1,...,5} in each iteration and the 
results are used to update the current solution X^ ^. The 
iterative procedure for computing the prox of functions 
fi, \/i G {1, . . . , 5}, and the updating steps are repeated until 
convergence is reached. The authors have shown that the 
sequence (X^) t >i generated by the PPXA algorithm is 
guaranteed to converge to the solution of problems such as 
the one given in Eq. (fT4b . 



IV. Experimental Results 

A. Setup 

We study now the performance of our distributed representa- 
tion scheme for the joint reconstruction of pairs of compressed 
images. The experiments are carried out in six natural datasets 
namely, Tsukuba (views center and right), Venus (views 2 and 
6) |[T6lL Plastic (views 1 and 2) ll30lL Flowergarden (frames 
5 and 6), Breakdancers (views and 2) and Ballet (views 
3 and 4) |3H . The first four datasets have been captured by 
a camera array where the different viewpoints correspond to 
translating the camera along one of the image coordinate axis. 
In such a scenario, the motion of objects due to the viewpoint 
change is restricted to the horizontal direction with no motion 
along the vertical direction. The depth estimation is thus a one- 
dimensional search problem and the data cost function given 
in Eq. © is modified accordingly. 

We compress the images independently using an H.264 intra 
coding scheme; this implementation is carried out using the 
JM reference software version 18 [|32l . The bit rate at the 
encoder is varied by changing the quantization parameter (QP) 
in the H.264 coding scheme. In our experiments, we use six 
different QP parameters, namely 51,48,45,42,39 and 35 in 
order to generate the rate-distortion (RD) plots. Also, we use 
the same QP value while encoding the images X\ and Z2, 
in order to ensure balanced rate allocation among different 
cameras. We estimate a depth image from the decoded images 
1\ and I2 by solving the regularized energy minimization 
problem of Eq. (OQ) using a-expansion algorithm in Graph Cuts 
1221 . Unless stated explicitly, we solve the OPT-1 optimization 
problem with matrix M constructed using Eq. (fT2l) . The 
smoothness parameters (A, r) of the depth estimation problem 
of Eq. © and the (61,62) parameters of the OPT-1 joint 
reconstruction problem are given in Table [] for all the six 
datasets; these parameters are selected based on trial and error 
experiments. The solution to the OPT-1 problem is computed 
by running the PPXA algorithm for 100 iterations. 

We report in this section the performance of the proposed 
joint reconstruction scheme and highlight the benefit of ex- 
ploiting the inter- view correlation while decoding the images. 
We then study the effect of compression on the quality of 
the estimated depth images. Then, we analyze the importance 
of the matrix M that enforces correlation consistency only 
on the corresponding pixels (i.e., the pixels that are not 
occluded) on the quality of the reconstructed images. Finally, 



we compare the rate-distortion performance of our scheme 
w.r.t. state-of-the-art distributed coding solutions and joint 
encoding algorithms. 

TABLE I 

The parameters (A, r) in Eq. Q and (ei, €2) in the OPT-1 problem 

USED IN OUR EXPERIMENTS. 



Dataset 


A 


r 


ei 


£2 


Tsukuba 


190 


4 


3 


2 


Venus 


220 


4 


1 


2 


Plastic 


120 


4 


1 


2 


Flowergarden 


170 


3 


1 


1.25 


Breakdancer 


300 


160 


2 


1 


Ballet 


290 


160 


1 


2.2 



B. Performance Analysis 

We first compare our joint reconstruction results with re- 
spect to a scheme where the images are reconstructed inde- 
pendently. Fig. [2a), Fig- Eft) an d Fig. [3] compare the overall 
quality of the decoded images between the independent (de- 
noted as H.264 Intra) and the joint decoding solutions (denoted 
as Proposed), respectively for the Venus, Flowergarden and 
Breakdancers datasets. The x-axis represent the total number 
of bits spent on encoding the images and the y-axis represent 
the mean PSNR value of the reconstructed images I\ and I2 . 
From the plots, we see that the proposed joint reconstruction 
scheme performs better than the independent reconstruction 
scheme by a margin of about 0.7 dB, 0.95 dB and 0.3 dB 
respectively for the different datasets. This confirms that the 
proposed joint decoding framework is effective in exploiting 
the inter- view correlation while reconstructing the images. 
Similar experimental results have been observed on other 
datasets. When compared to the first two datasets, the gain due 
to joint reconstruction for the Breakdancers dataset is smaller 
as confirmed in Fig. [3] It is well known that this dataset is 
weakly correlated due to large camera spacing [31 1, hence the 
gain provided by the joint decoding is small. 

We then quantitatively compare the RD performances be- 
tween the joint and the independent coding schemes using 
the Bjontegaard metric [33 1. In our experiments, we use the 
first four points in the RD plot for the computation in order to 
highlight the benefit in the low bit rate region; this corresponds 
to the QP values 51,48,45 and 42. The relative rate savings 
due to joint reconstruction for all the six datasets is available 
in the second column of Table HI From the values in Table m 
we see that the benefit of joint reconstruction depends on the 
correlation among the images; in general, higher the corre- 
lation, the better the performance. For example, we see that 
the Flowergarden dataset gives 22.8% rate savings on average 
compared to H.264 intra due to very high correlation. On the 
other hand, the Breakdancers and Ballet datasets only provide 
about 5% rate savings due to weak correlation mainly because 
of large distances between the cameras. Though the gain is 
small for these datasets, we show later that the performance 
of our scheme competes with the performance of the joint 
encoding solutions based on H.264 at low bit rates. 

We then carry out the same experiments in a scenario, where 
the images are jointly reconstructed using a correlation model 
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Fig. 3. Comparison of the rate-distortion performances between the indepen- 
dent and the joint decoding schemes as well as H.264-based joint encoding 
schemes for the Breakdancers dataset. 

TABLE II 

Rate savings with respect to the independent coding schemes 
based on h.264 intra for the stereo images. the average rate 
savings (%) is computed using the b jontegaard metric |33l for 
the qp values 52, 48, 45 and 42. 



Dataset 


Proposed 


Proposed: 
True depth 


H.264: 4x4 


H.264 


Tsukuba 


14.9 


20.5 


15.8 


42.4 


Venus 


15.9 


21.3 


12.2 


44.9 


Plastic 


10.3 


11 


8.4 


28.7 


Flowergarden 


22.8 


29.3 


29.2 


46.5 


Breakdancer 


5.8 


6.6 


-6.5 


8.9 


Ballet 


4.4 


6.1 


-8.9 


2.7 



that is estimated from the original images. This scheme thus 
serves as a benchmark for the joint reconstruction, since the 
correlation is accurately known at the decoder. The corre- 
sponding results are denoted as proposed: True depth in Fig. [2 
The corresponding rate savings compared to the independent 
compression based on H.264 intra is given in the third column 
of Table HI At low bit rates, in general, we see that our 
scheme is away from the upper bound performance due to the 



poor quality of the depth estimation from compressed images. 
For example, in Fig. Ob) (for Flowergarden dataset) we see 
that at bit rate of 0.2 (i.e., QP = 51), the proposed scheme 
is away from the upper bound performance by a margin of 
around 0.5 dB. As a result, we see in Table III that the rate 
savings is better, when the actual depth information is used 
for the joint reconstruction compared to the performance of 
the scheme where the depth information is estimated from 
compressed images. We show in Fig. Hfb) and Fig. HJd) the 
inverse depth images (i.e., disparity images) estimated from the 
decoded images Ji, J2 that are encoded with QP = 51 (resp. 
total bit rate = 0.08 bpp) and QP = 35 (resp. total bit rate = 
0.98 bpp), respectively for the Venus dataset. Comparing the 
respective disparity images with respect to the actual disparity 
information in Fig. HJa) we observe poor quality disparity 
results for QP = 51. Quantitatively, the errors in the disparity 
images are found to be 43% and 12%, respectively for QP = 51 
and QP = 35, when it is measured as the percentage of pixels 
with an absolute error greater than one. This confirms that the 
quantization noise in the compressed images are not properly 
handled while estimating the correlation information. Similar 
conclusions can be derived for the Flowergarden dataset from 
Fig. where, in general, the estimated depth information from 
highly compressed images is not accurate. Developing robust 
correlation estimation techniques to alleviate this problem 
is the target of our future works. We finally see in Fig. 
that the reconstruction quality achieved with the correlation 
estimated from compressed images converges to the upper- 
bound performance when the rate increases or equivalently, 
when the quality of decoded images I\ and I2 improves. 

We now analyze the importance of the matrix M in the 
optimization problem IOPT-11 which enables us to measure 
the correlation consistency objective only to the non-occluded 
pixels, i.e., the holes in the warped image A • TZ(Ii) are 
ignored while measuring the correlation consistency between 
the images A • TZ(Ii) and 71(12)- In order to highlight the 
benefit, we first solve the OPT-1 joint reconstruction problem 
by setting M = 1. The corresponding reconstructed right 
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Fig. 4. Comparison of the estimated depth image from compressed images with respect to the actual depth information for the Venus dataset. (a) Groundtruth 
disparity image s/D g ; (b) computed disparity image s/D at QP = 51; (c) disparity error at QP = 51. The pixels with absolute error greater than one is marked 
in white. The percentage of white pixels is 43%. (d) Computed disparity image s/D at QP = 35; (e) disparity error at rate at QP = 35. The percentage of 
white pixels is 12%. The parameter s represents the product of the focal length and the baseline distance between the cameras C\ and C2. 
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Fig. 5. Comparison of the estimated depth image from compressed images with respect to the actual depth information for the Flowergarden dataset. (a) 
Groundtruth disparity image s/D g ; (b) computed disparity image s/D at QP = 51; (c) disparity error at QP =51. The pixels with absolute error greater than 
one is marked in white. The percentage of white pixels is 25.3%. (d) Computed disparity image s/D at QP = 35; (e) disparity error at rate at QP = 35. The 
percentage of white pixels is 6.6%. The parameter s represents the product of the focal length and the baseline distance between the cameras C\ and C2. 
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Fig . 6. Imp ortance of the matrix M in the IOPT- ll optimization problem, (a) Original right image; and (b) reconstructed rig ht image obtained as a solution of 
the IOPT- il problem when M = 1. (c) and (d) Reconstructed right and left images, respectively obtained as a solution of the IOPT- ll problem, when the matrix 
M is constructed based on Eq. (fl"2V The PSNR values of the reconstructed images are: (b) 26.84 dB; (c) 30.01 dB; and (d) 29.97 dB. The experiments are 
carried out in the Tsukuba stereo dataset, where the images are encoded with a QP value of 42. 



image I2 is shown in Fig. Ob). Comparing it with the original 
right view X2 in Fig. 0a), we see that the visual artifacts are 
noticeable in the reconstructed right image I2. In particular, 
we notice strong artifacts along the edges of the lamp holder 
and in the face regions; this is mainly due to the improper 
handling of the occluded pixels. Quantitatively, the PSNR 
of the reconstructed image I2 is 26.84 dB (respectively the 
quality of the reconstructed left view I\ is 29.95 dB). We 
then solve the OPT-1 optimization problem with a matrix M 
constructed using Eq. ([T2l) . The corresponding reconstructed 
right image I2 and left image I\ is available in Fig. 0c) 
and Fig. [3d), respectively. We now do not see any annoying 
artifacts in the reconstructed image I2 due to the effective 
handling of the occlusions via the matrix M. Also, the quality 
of the reconstructed images becomes quite similar and the 
respective values for the right and left views are 30.01 dB 
and 29.97 dB. 



We then compare the RD performance of our scheme to a 
distributed coding solution (DSC) based on the LDPC encod- 
ing of DCT coefficients, where the disparity field is estimated 
at the decoder using Expectation Maximization (EM) algo- 
rithms |9|. The resulting RD performance is given in Fig. |7Ja) 
and Fig. [TJb) (denoted as Disparity learning) for the Venus 
and Flowergarden datasets, respectively. In the DSC scheme, 
the Wyner-Ziv image X2 is decoded with the JPEG-coded 
reference image X\ as the side information. In order to have a 
fair comparison between the proposed scheme and this DSC 
scheme |9], we carry out our joint reconstruction experiments 
with the JPEG compressed images. That is, instead of H.264 
intra we now use JPEG for independently compressing the 
images X\ and X2 . Then, from the JPEG coded images I\ and 
I2, we jointly reconstruct a pair of images I\ and I2 using the 
methodology described in Section III The resulting RD per- 
formance of the proposed scheme is available in Fig. |7ta) and 
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(a) (b) 

Fig. 7. Comparison of the rate-distortion performance between the independent (JPEG) and the joint decoding schemes as well as the DSC scheme based on 
disparity learning |9 |: (a) Venus dataset; and (b) Flowergarden dataset. In this plot, the independent compression of the images X\ and X2 is performed using 
the JPEG coding scheme. 



Fig- 13b), respectively for both datasets. We first notice that 
the proposed joint reconstruction scheme improves the quality 
of the compressed images; this is consistent with our earlier 
observations. We further observe that the disparity learning 
scheme marginally improves the quality of the compressed 
images only at low bit rates, however, it fails to perform better 
than the JPEG coding scheme at high bit rates. Also, we note 
that the DSC scheme in requires a feedback channel in 
order to accurately control the LDPC encoding rate, while our 
proposed solution does not require any statistical correlation 
modeling at the encoder nor any feedback channel; this clearly 
highlights the benefits of the proposed solution. 

For the sake of completeness, we finally compare the 
performance of our scheme compared to the joint encoding 
solutions based on H.264. In particular, the joint compression 
of views is carried out by setting the profile ID = 128; this 
corresponds to the stereo profile [|32l . In this profile, one 
of the images (say X\) is encoded as a I-frame while the 
remaining view (say I2) is encoded as a P-frame. We consider 
two different settings in the H.264 motion estimation, which 
is performed with a variable and a fixed macroblock size of 
4x4. The RD performance corresponding to both cases (resp. 
denoted as H.264 and H.264: 4 x 4 blocks) is available in 
Fig. 0a), Fig. 0b) and Fig. [3] for the Venus, Flowergarden 
and Breakdancers datasets, respectively. Also, we report in 
the columns 4 and 5 of Table El the rate savings of the joint 
encoding scheme compared to the H.264 intra scheme. First, 
it is interesting to note that for rectified images (or when the 
camera motion is horizontal), our scheme competes with the 
H.264 joint encoding performance when a block size is set to 
4x4. However, our scheme could not perform as well at high 
bit rates due to the lack of texture encoding. In other words, 
our scheme decodes the images by exploiting the geometrical 
correlation information while the visual information along the 
texture and edges are not perfectly captured. However, for the 
non-rectified images like the Breakdancers dataset (see Fig.0, 
we see that our scheme competes with the joint encoding 



solutions based on H.264. Similar conclusions can be derived 
for the Ballet dataset in Table III where the proposed scheme 
provides rate savings of 4.4%, while H.264 saves only 2.7%. 
This is because, when the images are not rectified, which is 
the case in the Breakdancers and Ballet datasets, the block- 
based motion compensation is not an ideal model to capture 
the inter-view correlation. Also, for the same reason, we see 
in Fig. [3] that the H.264 joint encoding with 4x4 blocks 
performs even worse than the H.264 intra coding scheme; this 
is indicated with a negative sign in Table [III 

V. Joint Reconstruction of multiple images 
A. Optimization Problem 

So far, we have focused on the distributed representation 
of pairs of images. Now, we describe the extension of our 
framework to datasets with J correlated images X\ , X2 , . . . , Xj 
that are captured by the cameras C\ , C2 , • . • , C j from dif- 
ferent viewpoints. We further assume that the J cameras 
are calibrated, where we denote the intrinsic camera matrix 
respectively, for the J cameras as Pi, P2, . . . , Pj. Also, let 
Ri, R2, . . . , Rj and T±, T2, . . . , Tj, respectively represent the 
rotation and translation of the J cameras with respect to 
the global coordinate system. Similarly to the stereo setup, 
the J correlated images X\ , Z2 , • • • , are compressed inde- 
pendently (e.g., H.264 intra or JPEG) with a balanced rate 
allocation. The compressed visual information is transmitted 
to the central decoder, where we jointly process all the J 
compressed views in order to take benefit of the inter-view 
correlation for improved reconstruction quality. In particular, 
as carried out in stereo decoding framework, we first estimate a 
depth image from the J decoded images (resp. Ji, 12, . . . , Ij) 
and we use it for joint signal recovery. The J reconstructed 
images are respectively given as /1, ^2, • • • , Ij- 

We propose to estimate the depth image from the J decoded 
images in a regularized energy minimization framework as a 
tradeoff between a data term Ed and a smoothness term £ s . 
The depth image D is estimated by minimizing the energy £ 



10 



that is represented as 



D — argmin £(D C ) = argmin {£d(D c 

D c D c 



X£ S (D C )}. (19) 



where D c represents the candidate depth images. Note that 
this formulation is similar to Eq. (GJ in the stereo case. 

The data term £d(D c ) in the multi-view setup should 
measure the cost of assigning a depth image D c that is globally 
consistent with all the compressed images. In the literature, 
there are plenty of works that address the problem of finding 
a good multi-view data cost function with global consistency, 
e.g., ll34lL ifTTlL (35). In this work, for the sake of simplicity, 
we propose to compute the global photo consistency as the 
cumulative sum of the data term Ed(D c ) given in Eq. ©. 
That is, the global photo consistency term is given as 



Hi, 



J N U N 2 

£ d (D c ) = J2 £ ||J j (m,n)->V ; -(/i(m ) n),D c (m,i 

j—2 m,n 

(20) 

where Wj is the warping function that projects the intensity 
values in the view 1 to the view j using the depth information 
D c . As described previously in Section UlTAl this warping is 
a two step process. We first project the pixels from view 1 
to the global coordinate system using Eq. © and then it is 
projected to the view j using the camera parameters Pj,Rj 
and Tj (see Eq. ©). The objective of the smoothness cost £ s is 
to enforce consistency in the depth solution. For a candidate 
depth image D c , the smoothness energy is computed using 
Eq. ©. Finally, the minimization problem of Eq. (|T9l) can be 
solved using strong optimization techniques (e.g., Graph Cuts) 
in order to estimate a depth image D from the decoded images. 
At last, we note that one could estimate a more accurate 
depth information by considering additional energy terms in 
the energy model of Eq. (TT91) in order to properly account for 
the occlusions, global scene visibility, etc. More details are 
available in the overview paper [ 341 . 

Now, we focus on the joint decoding problem, where we 
are interested in the reconstruction of J correlated images 
from the compressed information I\ , I 2 , . . . , Ij ; this is carried 
out by exploiting the correlation that is given in terms of 
depth information D or from the op erato r A derived from 
the depth D as described in Section III-B1 In particular, we 
can represent the warping operation Wj{I\,D) as matrix 
multiplication of the form Ij = Aj-lZ(Ii), where Ij represents 
an approximation of the image at viewpoint j. We propose to 
jointly reconstruct the J multi-view images as a solution to 
the following optimization problem: 



(OPT-2) 



(I 1 J 2 ,...Jj)= argmin Y^||//|| TV 

s.t. \\n(ii)-n(ii)\\ 2 <6i, 
\\n(i2)-n(i2)\\ 2 <s u ..., 

\\K(Ij)-H(ij)\\ 2 <6 u 

j 

^2\\M j (n(i j )-A j ^n(i 1 ))\\ 2 2 <s 2 , 

i=2 

where Mj (see Eq. (fT2l) ) is a diagonal matrix that is con- 



structed using a similar procedure described in Section III-C[ 
this allows to measure the correlation consistency to only to 



those pixels that are available in all the views. From the above 
equation, we see that the proposed reconstruction algorithm 
estimates J TV smooth images that are consistent with both 
the compressed and the correlation (depth) informations. It is 
interesting to note that by setting J = 2 in OPT-2, we get the 
stereo joint reconstruction problem OPT-1. 

Finally, using the results derived in Prop. Q] it is easy 
to check that the optimization problem OPT-2 is convex. 
Therefore, our multi-view joint reconstruction problem OPT-2 
can also be solved using proximal splitting methods. We can 
rewrite the OPT-2 problem as 
j 



argmin £ H^" 1 || 



xe 



TV 



(21) 



s.t. iis^y-xjH^ ||s 2 (y-x)|| 2 <*!,..., 

\\Sj(Y-X)\\ 2 <5 u \\HX\\l<5 2 . 
Here, X = [11(h); 11(h); ■■■ Y 



[11(h); 11(h); ••• ;ll(Ij)}, S 1 = [t 
[0 • • • 1], and the matrix H is given as 



H = 



-M 2 A 2 M 2 
-M 3 A 3 M 3 



-MjAj 










Mj 



0], Sj = 



(22) 



It can be noted that the above optimization problem is an 
extension to the one described in Eq. (TT3K where the TV prior, 
measurement and correlation consistency objectives are now 
applied to all the J images. Therefore, the prox operators for 
the objective function and the constraints of Eq. (|2TT) can be 
computed as described in Section [TTT1 

B. Performance Evaluation 

We now evaluate the performance of the multi-view joint 
reconstruction algorithm using five images (center, left, right, 
bottom and top views) of the Tsukuba ifTTl . three views 
(views 0, 1 and 2) of the Plastic l30l . three views (views 
0, 2 and 4) of the Breakdancers and three views (views 3, 
4 and 5) of the Ballet ETJ. Similarly to the stereo setup, we 
independently encode the multi-view images using H.264 intra 
by varying the QP values. At the joint decoder, we estimate 
a depth image D from the compressed images by solving 
Eq. 02) with parameters (A, r) = (390, 4), (180, 4), (330, 180) 
and (300, 180), respectively for the different datasets. Then, 
using the estimated depth image D we jointly decode the 
multiple views as a solution to the problem OPT-2 with the 
matrix Mj constructed using Eq. (fT2l) . This problem is solved 
with the parameters (S u 5 2 ) = (2.5, 7), (1, 3), (2.3, 2) and 
(1.1,4.3), respectively for the datasets. Finally, we iterate the 
PPXA algorithm for 100 times in order to reconstruct the J 
correlated images. 

We first compare our results with a stereo setup, where the 
depth estimation and the joint reconstruction steps are carried 
out with pairs of images. In more details, we take X\ as being 
the center image in Tsukuba, the view 1 in Plastic, the view 
2 in Breakdancers and the view 4 in Ballet, respectively and 
we perform joint decoding between the image X\ and rest of 
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(a) (b) 

Fig. 8. Comparison of the rate-distortion performance between the independent, proposed, DISCOVER 1 10] and H.264-based joint encoding schemes: (a) 
Tsukuba dataset; and (b) Plastic dataset. The joint reconstruction is performed with J = 5 and J = 3 views, respectively for the Tsukuba and Plastic datasets. 




images by selecting different pairs of images independently 
(all pairs include X\). For example, for the Tsukuba dataset, 
we perform the depth estimation and the joint reconstruction 
steps in the following order: (i) center and right views; (ii) 
center and left views; (iii) center and top views; and (iv) center 
and bottom views. After decoding all the images, we take 
the mean PSNR of all the reconstructed images. Note that, in 
this setup the center image is reconstructed four times. For 
a fair comparison, we keep the reconstructed image 1\ that 
gives highest PSNR with respect to X\. In a similar way, the 
experiments are carried out for the other datasets, where we 
perform the joint reconstruction of pairs of images and then 
compute the average PSNR of the reconstructed images. The 
resulting RD performance is denoted as Proposed: Stereo in 
Fig. Sa), Fig. Btb), Fig. 13a) and Fig. [2b) for the different 
datasets. From Fig. ^ and Fig. © it is clear that the proposed 
joint multi-view reconstruction scheme (denoted as Proposed: 
Multiview) performs better than the algorithm where the im- 
ages are handled in pairs. It clearly highlights the benefits 



of our proposed solution. We also calculate the rate savings 
compared to an H.264 intra encoding and the results are tab- 
ulated in the second and third columns of Table (TTT1 It is clear 
that the rate savings are higher in the multi-view setup than 
in the stereo setup. Finally, we note that the proposed multi- 
view joint decoding framework is a simple extension of the 
stereo image reconstruction algorithm. Still, it permits to show 
experimentally that it is beneficial to handle all the multi-view 
images simultaneously at the decoder rather decoding them by 
pairs. We strongly believe that the rate-distortion performance 
in the multi-view problem can be further improved when the 
depth information is estimated more accurately. For instance, 
this can be achieved by explicitly considering the visibility 
and occlusion constraints in the depth estimation framework, 
e.g., lfl7lL E2J. We leave this topic as part of our future work. 

We then compare the RD performance of our multi-view 
joint decoding algorithm to a state-of-the-art distributed coding 
scheme (DSC) based on the DISCOVER OH. The DSC 
experiments are carried out in the following settings. In the 



12 



TABLE III 

Rate savings with respect to the independent coding schemes 

based on h.264 intra for the multi-view problem. the rate 
savings % is computed using the b jontegaard metric |33l for 
the qp values of 52, 48, 45 and 42. 



Data set 


Proposed: 
Stereo 


Proposed: 
Multiview 


H.264: 4x4 


H.264 


Tsukuba 


14.7 


19.2 


20.3 


77.8 


Plastic 


10.2 


13.2 


11.5 


45.5 


Breakdancers 


5.7 


7.8 


-1.5 


14.7 


Ballet 


4.2 


6.6 


-2.3 


9.2 



Tsukuba dataset, we consider four views, namely left, right, 
top and bottom images as the key frames, and the center view 
is considered as the Wyner-Ziv frame. At the decoder, we 
generate a side information by fusing two side information 
images that are generated based on motion compensated 
interpolation: (i) from the left and right decoded views; and 
(ii) from the top and bottom decoded views. This fusion 
step is implemented using the algorithm proposed in (36). 
For the other datasets, we consider the two extreme views 
as the key frames and the center view is considered as the 
Wyner-Ziv frame. In this scenario, a side information image is 
generated based on motion compensated interpolation from the 
decoded key frames. The resulting rate-distortion performance 
is available in Fig. [8] and Fig. [9] (denoted as DISCOVER). 
Comparing the performance of the proposed scheme (denoted 
as Proposed: Multiview) and the DISCOVER scheme, we 
show that our scheme outperforms the distributed coding 
solution. Note that this is the case even in the Tsukuba 
dataset, where four images are fused together to estimate 
the best possible side information. Furthermore, we can see 
that the DSC scheme based on DISCOVER actually performs 
worse (expect for the Tsukuba dataset) than the H.264 intra 
scheme where all the images are decoded independently. This 
is mainly due to the poor quality of the side information 
image generated based on motion compensated interpolation. 
In other words, the linear motion assumption is not an ideal 
model for capturing the correlation between images captured 
in multi-view camera networks. Finally, it is interesting to 
note that our joint decoding framework does not require a 
Slepian-Wolf encoder nor any feedback channel, while the 
DISCOVER coding scheme requires a feedback channel to 
ensure successful decoding; this comes at the price of high 
latency due to multiple requests from the decoder @. 

For the sake of completeness, we finally compare the 
performance of our scheme with respect to the joint encoding 
framework based on H.264 with an IPP coding structure. More 
precisely, we consider one of the views as the I-frame (this 
is the views center, 0, and 3 for the different datasets, 
respectively.), and the remaining views are encoded as P- 
frames. We perform the joint encoding experiments where the 
motion compensation is carried out in both variable and fixed 
block size of 4 x 4. The resulting rate-distortion performance is 
available in Fig. [5] and Fig. [9J The corresponding rate savings 
with respect to the H.264 intra are available in columns 4 and 5 
of Table [Till From the plots (see Figs. [8] and O and from Table 
Hill it is clear that our proposed multi-view reconstruction 
scheme competes and sometimes beats the performance of 



H.264 4x4 scheme at low bit rates; this is consistent with 
the tendencies we have observed in the stereo experiments. 
However, at high bit rates our scheme performs worse than the 
H.264 joint coding scheme due to suboptimal representation 
of high frequency components such as edges and textures. 
Contrarily to H.264, our scheme is however distributed and 
this reduces the complexity at the encoders, which is attractive 
for distributed processing applications. 

VI. Conclusions 

In this paper, we have proposed a novel rate balanced dis- 
tributed representation scheme for compressing the correlated 
multi-view images captured in camera networks. In contrary to 
the classical DSC schemes, our scheme compresses the images 
independently without knowing the inter- view statistical rela- 
tionship between the images at the encoder. We have proposed 
a novel joint decoding algorithm based on a constrained op- 
timization problem that permits to improve the reconstruction 
quality by exploiting the correlation between images. We have 
shown that our joint reconstruction problem is convex, so that 
it can be efficiently solved using proximal methods. Simu- 
lation results confirm that the proposed joint representation 
algorithm is successful in improving the reconstruction quality 
of the compressed images with a balanced quality between 
the images. Furthermore, we have shown by experiments 
that the proposed coding scheme outperforms state-of-the-art 
distributed coding solutions based on disparity learning and 
on the DISCOVER. Therefore, our scheme certainly provides 
an effective solution for distributed image processing with low 
encoding complexity, since it does not require a Slepian-Wolf 
encoder nor a feedback channel. Our future work focuses 
on developing robust techniques to estimate more accurate 
correlation information from highly compressed images. 
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